Search

Summary

Introduction

Support vector machines (SVM) are currently one of the most popular and accurate methods for binary data classification and prediction. They have been applied to a variety of data and situations such as cyber-security, bioinformatics, web searches, medical risk assessment, financial analysis, and other areas [1]. This type of machine learning is shown to be accurate and is able to generalize predictions based upon previously learned patterns. However, current implementations are limited in that they can only be trained accurately on examples numbering to the tens of thousands and usually run only on serial computers. There are exceptions. A prime example is the annual machine learning and classification competitions such as the International Conference on Artificial Neural Networks (ICANN), which present problems with more than 100,000 elements to be classified. However, in order to treat such large test cases the formalism of the support vector machines must be modified.

SVMs were first developed by Vapnik and collaborators [2] as an extension to neural networks. Assume that we can convert the data values associated with an entity into numerical values that form a vector in the mathematical sense. These vectors form a space. Also, assume that this space of vectors can be separated by a hyperplane into the vectors that belong to one class and those that form the opposing class.

Summary

Discovering Biological Mechanisms through Exploration

The availability of massive amounts of data in biological sciences is forcing us to rethink the role of hypothesis-driven investigation in modern research. Soon thousands, if not millions, of whole-genome DNA and protein sequence data setswill be available thanks to continued improvements in high-throughput sequencing and analysis technologies. At the same time, high-throughput experimental platforms for gene expression, protein and protein fragment measurements, and others are driving experimental data sets to extreme scales. As a result, biological sciences are undergoing a paradigm shift from hypothesisdriven to data-driven scientific exploration. In hypothesis-driven research, one begins with observations, formulates a hypothesis, then tests that hypothesis in controlled experiments. In a data-rich environment, however, one often begins with only a cursory hypothesis (such as some class of molecular components is related to a cellular process) that may require evaluating hundreds or thousands of specific hypotheses rapidly. This large number of experiments is generally intractable to perform in physical experiments. However, often data can be brought to bear to rapidly evaluate and refine these candidate hypotheses into a small number of testable ones. Also, often the amount of data required to discover and refine a hypothesis in this way overwhelms conventional analysis software and hardware. Ideally advanced hardware can help the situation, but conventional batch-mode access models for high-performance computing are not amenable to real-time analysis in larger workflows. We present a model for real-time data-intensive hypothesis discovery process that unites parallel software applications, high-performance hardware, and visual representation of the output.

Search Results

Refine search

Refine search

Actions for selected content:

2 results

7 - Binary Classification with Support Vector Machines

Summary

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Summary

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

2 results

7 - Binary Classification with Support Vector Machines

Summary

9 - Let the Data Do the Talking: Hypothesis Discovery from Large-Scale Data Sets in Real Time

Summary